In this file, we begin to consider relationships between variables. We have two basic types of questions that we’re interested in:
Given the negative values in the disbursal_timediff variable, however, we discussed cutting the supply chain in two and examining the “posting-to-funding” component separately from the “disbursing-to-repaying” component.
We also generated some data partitioning ideas at the end of our single-variable EDA, namely:
Let’s load the data and get started.
library(jsonlite)
library(ggplot2)
library(psych)
library(RColorBrewer)
library(countrycode)
loans = jsonlite::fromJSON("newpaid.json")
loans$posted_datetime = strptime(loans$posted_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$funded_datetime = strptime(loans$funded_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$disbursal_datetime = strptime(loans$disbursal_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$paid_datetime = strptime(loans$paid_datetime, format = "%Y-%m-%d %H:%M:%S")
We know that we have no missing data, so we’ll dive right in.
First, we’ll generate a scatterplot matrix for the numeric loan feature variables:
pairs.panels(loans[, c(8, 9, 17, 18)])
Most pairs of variables are not correlated. There is no relationship between the length of a loan’s description and its amount, the number of borrowers, or the gender composition of its borrowers, nor is the gender composition of the borrowers related to the amount or number of borrowers. (The apparent structure in the number-female plot is merely an artifact of integer denominators.) The only correlation is between the loan amount and the number of borrowers. Although many loans, regardless of amount, have a single borrower – perhaps a representative for a larger group – the general trend is that a larger loan means more borrowers.
We see a few high-leverage points, where the loan amount is greater than 5000 USD, but these do not seem to be throwing off the overall trends.
Next, we want to consider categorical features of the loan, country and sector, in relation to our other loan variables. Let’s look at sector and loan amount first:
ggplot(data=loans, aes(x=sector, y=terms.loan_amount, colour=sector)) +
geom_boxplot() + scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Loan amount") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
For the bulk of the loans, there isn’t much difference between sectors: the mass is concentrated below about 1500 USD. Sector distinctions seem to matter more when we look at the tails. The education, food, manufacturing, and service sectors all have multiple loans that extend well beyond 1500 USD. We can examine this segment of loans values by itself:
large_loans = loans[loans$terms.loan_amount >= 1500, ]
ggplot(data=large_loans, aes(x=sector, y=terms.loan_amount, colour=sector)) +
geom_boxplot() + scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Loan amount") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Even for this segment, though, the median and IQRs by sector are comparable. Perhaps segmenting by loan amount won’t prove interesting after all.
How does the sector relate to the number of borrowers? Loans with one borrower will dominate the plot, so let’s look only at loans with more than one borrowers:
ggplot(data=loans[loans$b.num>1, ], aes(x=sector, y=b.num, colour=sector)) +
geom_boxplot() + scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Number of borrowers") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Again, the ranges are quite comparable. Entertainment sector is the only real stand-out.
Is there a relationship between gender and sector?
ggplot(data=loans, aes(x=sector, y=b.female, colour=sector)) +
geom_boxplot() + scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Fraction of females in borrowing group") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Yes, there are certain sectors where the borrowers are almost exclusive female: arts, clothing, food, and retail. This is the exception, however. All in all, sector doesn’t seem to reveal much when it comes to the characteristics of a loan. We can quantify its effect on the loan amount:
summary.lm(aov(terms.loan_amount ~ sector, loans))
##
## Call:
## aov(formula = terms.loan_amount ~ sector, data = loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -862.9 -430.4 -192.6 187.1 9282.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 716.6243 13.7199 52.232 < 2e-16 ***
## sectorArts 95.4826 42.6293 2.240 0.02512 *
## sectorClothing 152.0449 27.3214 5.565 2.67e-08 ***
## sectorConstruction 98.5481 44.4358 2.218 0.02659 *
## sectorEducation 163.6535 77.0980 2.123 0.03380 *
## sectorEntertainment 133.3757 180.4577 0.739 0.45986
## sectorFood 0.9781 18.1311 0.054 0.95698
## sectorHealth 182.1045 67.6632 2.691 0.00713 **
## sectorHousing -8.4243 39.6187 -0.213 0.83162
## sectorManufacturing 98.8635 52.1076 1.897 0.05781 .
## sectorPersonal Use 46.9174 74.7286 0.628 0.53012
## sectorRetail 17.2302 18.6083 0.926 0.35449
## sectorServices 171.2642 25.7246 6.658 2.89e-11 ***
## sectorTransportation -36.1917 35.0792 -1.032 0.30222
## sectorWholesale 31.9051 124.1948 0.257 0.79726
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 719.7 on 13763 degrees of freedom
## Multiple R-squared: 0.007283, Adjusted R-squared: 0.006273
## F-statistic: 7.212 on 14 and 13763 DF, p-value: 3.571e-15
The very low adjusted R-squared means a negligible effect size.
What about country? Does it help explain the characteristics of a loan? In the plot below, countries are colored by continent.
# helper function
get_continent = function(df){
countrycode(df$location.country_code, origin="iso2c", destination="continent")
}
ggplot(data=loans, aes(x=location.country_code, y=terms.loan_amount,
fill=get_continent(loans))) + geom_boxplot() +
xlab("Country") + ylab("Loan amount") +
scale_fill_discrete(name="Continent") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Here, we see a significant variance in loan amount, within and across countries. Some countries have a mean loan amount of over 3000 USD (the U.S., notably), even as many have a mean of under 1000 USD. Even for these latter countries, however, there can be a large number of loans in the upper quartile. (These are the dots lying outside the boxplots.) African countries and countries from the Americas tend to have a lower median loan amount than do Asian countries. The lowest Asian country is Iraq; the sole European country is Ukraine, and the country from Oceania is Samoa.
Let’s remove countries that don’t show up much in our sample before analyzing how country relates to other loans variables.
high_countries = table(loans$location.country_code) >= 30
hc_vec = high_countries[loans$location.country_code] == TRUE
hc_loans = loans[hc_vec, ]
An ANOVA shows that country does a better job of explaining the variance in loan amount:
summary.lm(aov(terms.loan_amount ~ location.country_code, hc_loans))
##
## Call:
## aov(formula = terms.loan_amount ~ location.country_code, data = hc_loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1651.1 -269.8 -97.5 151.5 4348.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 704.78 70.69 9.970 < 2e-16 ***
## location.country_codeAM 1246.28 110.58 11.271 < 2e-16 ***
## location.country_codeAZ 531.89 79.50 6.691 2.31e-11 ***
## location.country_codeBJ -213.33 89.32 -2.388 0.016936 *
## location.country_codeBO 563.67 76.85 7.335 2.35e-13 ***
## location.country_codeCM -153.35 121.27 -1.265 0.206055
## location.country_codeCO -13.18 88.62 -0.149 0.881743
## location.country_codeCR 231.38 105.19 2.200 0.027852 *
## location.country_codeDO 870.02 87.72 9.919 < 2e-16 ***
## location.country_codeEC 115.05 76.80 1.498 0.134164
## location.country_codeGH -135.53 76.17 -1.779 0.075237 .
## location.country_codeGT 288.79 99.26 2.910 0.003625 **
## location.country_codeHN -123.10 89.03 -1.383 0.166801
## location.country_codeID -187.85 92.40 -2.033 0.042060 *
## location.country_codeIQ 1545.22 118.07 13.088 < 2e-16 ***
## location.country_codeKE -231.28 73.51 -3.146 0.001659 **
## location.country_codeKG 626.35 90.57 6.916 4.87e-12 ***
## location.country_codeKH 64.92 72.92 0.890 0.373343
## location.country_codeLB 549.71 81.01 6.785 1.21e-11 ***
## location.country_codeLR -260.60 98.25 -2.653 0.007998 **
## location.country_codeML 132.63 86.07 1.541 0.123327
## location.country_codeMN 591.03 81.99 7.209 5.95e-13 ***
## location.country_codeMX 207.07 76.45 2.708 0.006767 **
## location.country_codeMZ -190.80 103.72 -1.840 0.065846 .
## location.country_codeNG -95.35 77.40 -1.232 0.218010
## location.country_codeNI -157.57 73.74 -2.137 0.032639 *
## location.country_codeNP -468.35 121.27 -3.862 0.000113 ***
## location.country_codePE -77.97 72.11 -1.081 0.279603
## location.country_codePH -403.84 71.89 -5.618 1.97e-08 ***
## location.country_codePK 182.14 82.10 2.219 0.026528 *
## location.country_codePS 1165.03 106.81 10.907 < 2e-16 ***
## location.country_codePY 668.48 81.06 8.247 < 2e-16 ***
## location.country_codeQS -332.23 84.96 -3.910 9.26e-05 ***
## location.country_codeRW 177.18 84.55 2.096 0.036127 *
## location.country_codeSL 126.58 84.00 1.507 0.131866
## location.country_codeSN 338.81 82.32 4.116 3.88e-05 ***
## location.country_codeSV -46.79 80.05 -0.585 0.558850
## location.country_codeTG 95.73 79.94 1.198 0.231093
## location.country_codeTJ 236.75 75.35 3.142 0.001682 **
## location.country_codeTZ 56.71 78.67 0.721 0.471017
## location.country_codeUA 541.20 94.36 5.736 9.92e-09 ***
## location.country_codeUG 413.66 75.53 5.477 4.42e-08 ***
## location.country_codeVN 80.09 82.10 0.976 0.329289
## location.country_codeWS -81.76 84.08 -0.972 0.330854
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 582.9 on 13501 degrees of freedom
## Multiple R-squared: 0.2405, Adjusted R-squared: 0.2381
## F-statistic: 99.44 on 43 and 13501 DF, p-value: < 2.2e-16
The adjusted R-squared is 0.23, much better than the same statistic for sector. (If we include all countries, it climbs to 0.32.)
How does country relate to number of borrowers?
ggplot(data=hc_loans, aes(x=location.country_code, y=b.num,
fill=get_continent(hc_loans))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Number of borrowers") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, loans with one borrower dominate, yet there are some countries where the number of borrowers tends to range more widely, such as Paraguay (PY).
Does country relate to the gender composition of the borrowers?
ggplot(data=hc_loans, aes(x=location.country_code, y=b.female,
fill=get_continent(hc_loans))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Fraction of females in borrower group") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
In many countries, the loans are domainated by female borrower groups, but in many others, there are male groups as well. The most interesting feature of the above plot is probably the countries where there are a large number of mixed-gender borrower groups: Cambodia (KH), Peru (PE), The Philippines (PH), and Tanzania (TZ).
Finally, we can look at how our two categorical variables, country and sector, relate to one another. In our univariate EDA, we saw that agriculture, food, and retail were the main sector represented in the loan data. Is this true for each country?
getPalette = colorRampPalette(brewer.pal(11, "Spectral"))
num_sectors = length(unique(hc_loans$sector))
ggplot(data=hc_loans, aes(x=location.country_code, fill=sector)) +
geom_histogram(position="fill") +
scale_fill_manual(values = getPalette(num_sectors)) +
xlab("Country") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Not exactly. Those are the three largest sectors, but their relative frequencies differ from country to country. In some countries, agriculture accounts for well over 50% of loans, or even 100% of loans; in others, agriculture loans are outnumbered by food and retail loans.
The number of sectors can make it hard to be sure of these frequencies, so let’s look at the same plot but only for these top three sectors:
top_sectors = c("Agriculture", "Food", "Retail")
top_sector_loans = hc_loans[hc_loans$sector %in% top_sectors, ]
ggplot(data=top_sector_loans, aes(x=location.country_code, fill=sector)) +
geom_histogram(position="fill") +
scale_fill_manual(values = getPalette(3)) +
xlab("Country") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))
Now the distinctions are clear. This suggests that when we move on to multivariate analysis, we try conditioning on the borrower’s country before examining how sector relates to loan properties.
We have not yet looked at time series. Let’s do that now. We note that loans are not evenly spaced in time, so any time series analysis would quickly land us in complex territory. However, we can get a sense of the overall activity by looking the loan amount at the posted_datetime variable:
ggplot(data=loans, aes(x=posted_datetime, y=terms.loan_amount)) + geom_line() +
xlab("Time") + ylab("Loan Amount") + theme_bw()
The sporadic early activity suggests we not use dates early than 2007. Let’s narrow the window and use a smoothing function to get a better sense of the trend:
loans07 = loans[loans$posted_datetime > strptime("2007-01-01", format="%Y-%m-%d"), ]
ggplot(data=loans07, aes(x=posted_datetime, y=terms.loan_amount)) +
geom_smooth(lwd=2, level=0.99) +
xlab("Time") + ylab("Loan Amount") + theme_bw()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Total loan amount is growing, as we would expect with a growing business. How do the number of borrowers change over time?
ggplot(data=loans07, aes(x=posted_datetime, y=b.num)) +
geom_smooth(lwd=2, level=0.99) +
xlab("Time") + ylab("Number of borrowers") + theme_bw()
Generally, this value goes up. And what about the gender composition of the borrowers?
ggplot(data=loans07, aes(x=posted_datetime, y=b.female)) +
geom_smooth(lwd=2, level=0.99) +
xlab("Time") + ylab("Fraction of females in borrower group") + theme_bw()
This value climbs for the first year – meaning that, on average, a greater fraction of people in the borrowing group are female – and then dips for the next three years.
How do loan amounts change over time for the top three sectors?
ggplot(data=loans07[loans07$sector %in% top_sectors, ],
aes(x=posted_datetime, y=terms.loan_amount, colour=sector)) +
geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Loan Amount") +
scale_colour_discrete(name="Sector") +
theme_bw()
And finally, we ask: How do loan amounts change over time by country? To avoid cluttering the plot, we can look at only the top five countries, in terms of how many loans they received:
top_countries = sort(table(loans07$location.country_code), decreasing=T)[1:5]
ggplot(data=loans07[loans07$location.country_code %in% names(top_countries), ],
aes(x=posted_datetime, y=terms.loan_amount, colour=location.country_code)) +
geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Loan Amount") +
scale_colour_discrete(name="Country") + theme_bw()
Kenya had a spike in 2008 in which they took higher-value loans, while the per-loan value for other countries has been increasingly more stably – Nicaragua and Cambodia, in particular.
Let’s revisit our scatterplot matrix, only now we will add “pipeline features” – that is, the timediffs that represent the difference in days between the different stages in the loan pipeline. We can also add in the number of lenders.
pairs.panels(loans[, c(8, 9, 17, 18, 5, 14, 15, 16)])
Recall that we are separating the funding component of the pipeline from the repayment component. That noted, we observe that the loan amount and the funding time are correlated, albeit weakly; this is not surprising. Also not surprising is the correlation between loan amount and lender count, although the strength of it suggests that the individual contribution per loan is effectively fixed. What is somewhat more surprising is the correlation (again, weak) between funding time and gender. This correlation is negative, meaning that as the fraction of females in a borrowing group increases, the funding time tends to drop. There are similar trends, though even weaker, between the same variables and the repayment time.
We can confirm that there is a significant difference in the all-male and not-all-male populations when it comes to funding time using a t-test:
t.test(loans$funded_timediff ~ loans$b.fem>0)
##
## Welch Two Sample t-test
##
## data: loans$funded_timediff by loans$b.fem > 0
## t = 21.8486, df = 3817.287, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.715813 4.448433
## sample estimates:
## mean in group FALSE mean in group TRUE
## 7.136061 3.053939
There is definitely a difference in mean funding time: 4.08 ± 0.37 days. Given that the mean for all-male borrowin groups is 7.14 days, this effect size is rather large. Recall that the median funding time is quite low: 0.71 days. (The mean is 3.99 days.) Since high-value loans tend to drive up the funding time, and since males tend to be associated with higher loan amount (albeit quite weakly), we wonder whether this effect holds when we remove larger loans from consideration. Taking out the upper quartile:
normal_loans = loans[loans$terms.loan_amount <= quantile(loans$terms.loan_amount)[4], ]
t.test(normal_loans$funded_timediff ~ normal_loans$b.fem>0)
##
## Welch Two Sample t-test
##
## data: normal_loans$funded_timediff by normal_loans$b.fem > 0
## t = 16.486, df = 2596.384, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.837099 3.603109
## sample estimates:
## mean in group FALSE mean in group TRUE
## 5.554105 2.334001
It does. The difference in means is now closer to 3 days, but the effect is real.
Moving on: The number of borrowers and the length of the description have no apparent effect on how quickly a loan is funded (or repaid). Lender count, meanwhile, is correlated with variables that one might expect: the loan amount, the number of borrowers, and the funding time. It is not correlated with the gender composition of the borrowing group. We can quickly see, too, that the dollar per lender is completely uncorrelated with gender:
cor(loans$b.female, loans$terms.loan_amount/loans$lender_count)
## [1] -0.01249957
If gender does have an impact on lenders’ behavior, it is not in how much they give, but rather on whether they give – and the more people that do give, the more quickly the loan is funded.
Now we ask how sector and country relate to pipeline variables. We’ll begin with sector and funding time:
ggplot(data=loans, aes(x=sector, y=funded_timediff,
colour=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Funding time") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
The median funding time for all sector is quite low, with slightly larger IQRs for sectors that one might expect: housing and transportation. The length of the each sector’s tail is the main distinguishing feature.
Does the number of lenders vary by sector?
ggplot(data=loans, aes(x=sector, y=lender_count,
colour=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Number of lenders") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Not really. As we saw before, sector and loan amount aren’t very related, and since loan amount is reasonably correlated with the number of lenders, the lack of a relationship here isn’t surprising.
Does sector help explain any variance in repayment time?
ggplot(data=loans, aes(x=sector, y=paid_timediff,
colour=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Repayment time") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, the answer is not really.
We move on to examining the relationship of the borrower’s country to the funding time.
hc_loans_0 = hc_loans[hc_loans$funded_timediff >= 0, ]
ggplot(data=hc_loans_0,
aes(x=location.country_code, y=funded_timediff,
fill=get_continent(hc_loans_0))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Funding time") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
There are a few countries whose median value is much higher than the global median, which is a mere 0.71 days. These counties include:
long_fund = sort(tapply(hc_loans_0$funded_timediff,
hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(long_fund), origin="iso2c", destination="country.name")
## [1] "Iraq" "Palestine, State of"
## [3] "Lebanon" "Armenia"
## [5] "Colombia" "El Salvador"
## [7] "Azerbaijan" "Bolivia, Plurinational State of"
## [9] "Kyrgyzstan" "Tajikistan"
How does the number of lenders relate to the borrowing country?
ggplot(data=hc_loans, aes(x=location.country_code, y=lender_count,
fill=get_continent(hc_loans))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Number of lenders") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, we see a fair amount of variance in the medians. Which countries require the most lenders?
high_lend_ct = sort(tapply(hc_loans_0$lender_count,
hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(high_lend_ct), origin="iso2c", destination="country.name")
## [1] "Iraq" "Armenia" "Palestine, State of"
## [4] "Kyrgyzstan" "Dominican Republic" "Mongolia"
## [7] "Lebanon" "Ukraine" "Costa Rica"
## [10] "Azerbaijan"
Some of these names appeared in the previous list, but others are new. If we look at the median loan amount by country, we can get a sense for why (some of) these countries require more lenders and more funding time:
high_loan_amt = sort(tapply(hc_loans_0$terms.loan_amount,
hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(high_loan_amt), origin="iso2c", destination="country.name")
## [1] "Iraq" "Palestine, State of" "Armenia"
## [4] "Dominican Republic" "Azerbaijan" "Lebanon"
## [7] "Kyrgyzstan" "Mongolia" "Ukraine"
## [10] "Costa Rica"
They have some of the biggest loans. (There are countries that have even larger loans, but fewer loans in number, such as the United States.)
Finally, we examine repayment time by country.
ggplot(data=hc_loans, aes(x=location.country_code, y=paid_timediff,
fill=get_continent(hc_loans))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Repayment time") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Here, the disparities are quite significant. If we wish to predict how long it will take a borrower to pay back a loan, that borrower’s country could certainly be helpful.
How do these new pipeline variables relate to the calendar time? We can check.
ggplot(data=loans07,
aes(x=posted_datetime, y=funded_timediff)) +
geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Funding time") +
theme_bw()
Funding time has been slowly increasing – but not by much. The maximum toward the end of record keeping could stand some investigation.
Do we see a corresponding increase in the repayment time?
ggplot(data=loans07,
aes(x=posted_datetime, y=paid_timediff)) +
geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Repayment time") +
scale_colour_discrete(name="Sector") +
theme_bw()
From mid-2009 on, the answer is yes: repayment time is increasing. However, there is a different unexplained maximum toward the start of recording keeping.
We have seen that there is a weak correlation between the loan amount and certain pipeline features, namely how long it takes a loan to get funded and repaid, and a strong correlation between the loan amount and the lender count. It is natural at this point to create certain rate variables and see whether more subtle patterns can be detected. We’ll consider a funding rate (in USD per day), a repayment rate (in USD per day), a lender rate (in USD per lender), and a borrower rate (in USD per borrower). These variables can help us to discover, for example, whether certain types of loans tend to be associated with a greater individual contribution or a slower repayment rate.
loans$f.rate = loans$terms.loan_amount / loans$funded_timediff
loans$l.rate = loans$terms.loan_amount / loans$lender_count
loans$r.rate = loans$terms.loan_amount / loans$paid_timediff
loans$b.rate = loans$terms.loan_amount / loans$b.num
We can get some summary statistics for each of our new features:
summary(loans$f.rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.28 173.40 772.00 2583.00 2745.00 181800.00
The median funding rate is 754 USD per day; given that the median loan amount is 550 USD, the median funding time of about 0.71 day is explainable. What’s going on with that extreme value?
loans[which.max(loans$f.rate), ]
## id status sector partner_id lender_count
## 12292 87088 paid Agriculture 48 1
## description.languages location.country_code terms.loan_amount
## 12292 es, en BO 600
## description.avgchar posted_datetime funded_datetime
## 12292 794 2009-02-04 12:10:08 2009-02-04 12:14:55
## paid_datetime disbursal_datetime funded_timediff
## 12292 2010-02-15 11:56:36 2009-01-21 08:00:00 0.0033
## disbursed_timediff paid_timediff b.num b.female f.rate l.rate
## 12292 -14.177 390.1643 1 0 181818.2 600
## r.rate b.rate
## 12292 1.537814 600
Here, we have a modest loan (600 USD) that got fully funded by one lender four minutes after it was posted. This observation reminds us that rates can be unstable.
The median lender rate, meanwhile, is about 30 USD per lender per loan, with an (enforced) minimum of 25 USD.
summary(loans$l.rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 27.11 30.30 39.05 36.25 1262.00
loans[which.max(loans$l.rate), ]
## id status sector partner_id lender_count description.languages
## 8820 286467 paid Retail 30 2 ru, en
## location.country_code terms.loan_amount description.avgchar
## 8820 AZ 2525 414
## posted_datetime funded_datetime paid_datetime
## 8820 2011-03-29 15:50:03 2011-03-31 22:08:57 2012-08-15 09:59:50
## disbursal_datetime funded_timediff disbursed_timediff paid_timediff
## 8820 2011-02-28 08:00:00 2.2631 -31.5479 534.0416
## b.num b.female f.rate l.rate r.rate b.rate
## 8820 1 0 1115.726 1262.5 4.728096 2525
The maximum lender rate was for one loan of 2525 USD that was funded by two lenders.
What about the borrower rate?
summary(loans$b.rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.167 250.000 425.000 560.600 750.000 10000.000
This distribution ranges much more widely than the lender rate, and we can see from the size of the fourth quartile that it is very positively skewed.
Lastly, we look at the repayment rate.
summary(loans$r.rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03846 1.15400 1.91700 2.97400 3.12400 52.86000
Some repayment times are negative, as we saw in our univariate analysis; this account for the negative values. Aside from that, we see another positively skewed distribution, with a modest median of 1.92 USD per day.
We can re-examine our correlation matrix with our rates:
pairs.panels(loans[, c(8, 17, 18, 14, 16, 19:22)])
On the lender side: The meager correlations we saw earlier between (numerical) loan features and pipeline features vanish when we look at rates. Neither the gender composition of the borrowing group nor the number of borrowers appears related to how quickly a loan is funded or how much each individual gives. The loan amount is also uncorrelated with these rates. (Note that if we remove loans that fall in the unstable upper quartile of funding rates, the correlation between gender and funding rate reappears, weakly, at r = 0.16.) One new relationship that we do see is the weak positive correlation between borrower rate and funding time, but when we compare borrower rate to funding rate, there is no correlation. The likely explanation here is that a higher borrower rate (more money per borrower) tends to mean a higher loan amount and thus a longer funding time; but since there is no relationship between loan amount and funding rate, the effect disappears when we consider rates.
These vanishing correlations suggest that if we wish to discover something interesting about how loan characteristics relate to lender behavior, we examine a subset of the loans – perhaps higher-value loans. As it stands, the value of most loans is small enough relative to the minimum lending amount that they admit a wide variance in lender behavior. Before trying that idea out, we’ll look at how country and sector relate to rates, however.
On the borrower side: We do notice some strong correlations between repayment rate and loan amount, as well as between repayment rate and number of borrowers. (Loan amount and number of borrowers are themselves strongly correlated, of course.) So larger loans tend to be repaid at a greater rate. There is also a weak negative correlation between borrower rate and gender composition: as the fraction of females in a borrowing group increases, the per-borrower amount tends to decrease. Finally, we note the moderate correlation between the borrower rate and the loan amount. This correlation seems at odds with some other correlations. As the loan amount increases, how can both the number of borrowers and the loan amount per borrower increase? They could, of course, both increase, if each borrower were taking on absolute responsibility, but what seems to be happening here is that there are two sub-populations: multi-borrower groups and single-borrower groups. The former is driving the line of best fit in the plot of borrower number vs. loan amount, while the latter is driving the line of best fit in the plot of borrower rate vs. loan amount. We should consider splitting out these two populations in the future.
Let’s now examine how country and sector relate to these rates.
We’ll begin with country and funding rate.
hc_loans$f.rate = hc_loans$terms.loan_amount / hc_loans$funded_timediff
hc_loans$l.rate = hc_loans$terms.loan_amount / hc_loans$lender_count
hc_loans$r.rate = hc_loans$terms.loan_amount / hc_loans$paid_timediff
hc_loans$b.rate = hc_loans$terms.loan_amount / hc_loans$b.num
ggplot(data=hc_loans, aes(x=location.country_code, y=f.rate,
fill=get_continent(hc_loans))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Funding rate") + ylim(c(0,3000)) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
The medians differ by as much as 1000 USD, but the ranges of the IQRs overlap more often than not, suggesting that the effect is due to chance.
What happens when we look at lender rate?
ggplot(data=hc_loans, aes(x=location.country_code, y=l.rate,
fill=get_continent(hc_loans))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Lender rate") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, low-value loans and 25 USD contributions dominate, so the individual contribution rate doesn’t differ much on the basis of the borrower’s country.
We move on to borrower rate by country:
ggplot(data=hc_loans, aes(x=location.country_code, y=b.rate,
fill=get_continent(hc_loans))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Borrower rate") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
We see more variance here. A lot of Asian countries, in particular, have high per-borrower loan amounts.
Next, we look at repayment rate by country:
ggplot(data=hc_loans, aes(x=location.country_code, y=r.rate,
fill=get_continent(hc_loans))) + geom_boxplot() +
scale_fill_discrete(name="Continent") +
xlab("Country") + ylab("Repayment rate") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, there is a fair amount of variance.
For each of these rates, we can quantify how much variance is explained by the borrower’s country:
summary.lm(aov(l.rate ~ location.country_code, hc_loans))$adj.r.squared
## [1] 0.00711905
summary.lm(aov(f.rate ~ location.country_code, hc_loans))$adj.r.squared
## [1] 0.009379178
summary.lm(aov(b.rate ~ location.country_code, hc_loans))$adj.r.squared
## [1] 0.4242269
summary.lm(aov(r.rate ~ location.country_code, hc_loans))$adj.r.squared
## [1] 0.1549973
So a borrower’s country does not seem to have much of a relationship to lender rates (0.007) or funding rates (0.01) – that is, how much each individual lenders give and how much funding per day a loan receives, respectively – but it does have a moderate relationship to borrower rates (0.42) and a weak relationship to repayment rates (0.16) – that is, the value of the loan per borrower and how much of a loan is repaid per day, respectively.
Does sector have an impact of either of our rates? Again, we begin with the funding rate, excluding the upper quartile:
ggplot(data=loans, aes(x=sector, y=f.rate,
color=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Funding rate") + ylim(c(0,3000)) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Some sectors (e.g. entertainment, housing, personal use) definitely get funded more slowly than do others (e.g. arts, education, health).
Does sector have any impact on the lender rate? We know that 25 USD contributions will dominate, so let’s look only at loans that had lender rates above 50 USD per lender.
ggplot(data=loans, aes(x=sector, y=l.rate,
color=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Lending rate") + ylim(c(50, 1000)) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
The medians are all comparable, though we can see that the most popular sectors – agriculture, food, and retail – do inevitably attract some generous lenders. We also notice that no entertainment loans averaged 50 USD per lender o more.
Does sector explain any of the variance in borrower rate?
ggplot(data=loans, aes(x=sector, y=b.rate,
color=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Borrower rate") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
It doesn’t seem to. What about repayment rate?
ggplot(data=loans, aes(x=sector, y=r.rate,
color=sector)) + geom_boxplot() +
scale_colour_discrete(guide=FALSE) +
xlab("Sector") + ylab("Repayment rate") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))
Again, sector doesn’t seem useful. We get numerical estimates:
summary.lm(aov(l.rate ~ sector, loans))$adj.r.squared
## [1] 0.0003695259
summary.lm(aov(f.rate ~ sector, loans))$adj.r.squared
## [1] 0.008509044
summary.lm(aov(b.rate ~ sector, loans))$adj.r.squared
## [1] 0.014931
summary.lm(aov(r.rate ~ sector, loans))$adj.r.squared
## [1] 0.01053561
All are effectively zero.
start_datetime = strptime("2007-01-01", format="%Y-%m-%d")
start_datetime = as.POSIXct("2007-01-01T00:00:00Z",
format = "%Y-%m-%dT%H:%M:%SZ")
loans$disbursal_start_time = difftime(loans$disbursal_datetime,
start_datetime, units="days")
pairs.panels(loans[, c(8, 17, 18, 14, 16, 19:22, 23)])
The only real correlation here is with funded_timediff – which would indicate that as the calendar time wore on, it took slightly longer for the average loan to get funded. Nothing else stands out.
We have done a fairly thorough sweep of the pairwise relationship between our variables, including some new rate variables that we created. From here, we could go on to explore multiple variables at once, or we could consider transforming or segmenting variables and reexamining pairwise combinations. Given the number of directions we could move in, perhaps it would be wise to step back and ask what it is we’re even trying to learn or predict – beyond mere “data surprises.”
But first, here is what we have learned:
We also encountered some curiosities in the data that we should address before moving on. Specifically:
head(hc_loans)
## id status sector partner_id lender_count
## 1 334113 paid Food 183 42
## 2 313179 paid Transportation 145 24
## 3 206002 paid Retail 145 15
## 4 354610 paid Education 113 15
## 5 226099 paid Agriculture 109 10
## 6 78960 paid Food 70 85
## description.languages location.country_code terms.loan_amount
## 1 en SL 1175
## 2 en PH 600
## 3 en PH 425
## 4 es, en GT 400
## 5 en KH 500
## 6 en PE 2725
## description.avgchar posted_datetime funded_datetime
## 1 690 2011-09-11 00:40:06 2011-09-29 01:06:26
## 2 524 2011-07-04 15:20:03 2011-07-08 06:18:03
## 3 1637 2010-07-01 07:00:16 2010-07-15 17:11:28
## 4 472 2011-11-05 23:50:06 2011-11-06 02:52:45
## 5 731 2010-10-01 07:00:09 2010-10-02 16:54:39
## 6 1890 2008-12-12 16:30:14 2008-12-16 15:52:58
## paid_datetime disbursal_datetime funded_timediff
## 1 2012-09-15 09:19:24 2011-08-15 07:00:00 18.0183
## 2 2012-01-15 11:01:44 2011-06-13 07:00:00 3.6236
## 3 2010-12-15 15:08:24 2010-05-07 07:00:00 14.4244
## 4 2012-05-15 08:52:45 2011-10-10 07:00:00 0.1685
## 5 2011-05-15 10:07:34 2010-07-28 07:00:00 1.4128
## 6 2009-06-15 22:51:05 2008-11-30 08:00:00 3.9741
## disbursed_timediff paid_timediff b.num b.female f.rate l.rate
## 1 -44.7545 397.0968 1 0 65.21148 27.97619
## 2 -24.9709 216.2095 1 1 165.58119 25.00000
## 3 -69.4246 222.3808 1 1 29.46396 28.33333
## 4 -26.8700 218.0783 1 1 2373.88724 26.66667
## 5 -66.4130 291.1303 1 1 353.90713 50.00000
## 6 -16.3284 197.5771 13 1 685.68984 32.05882
## r.rate b.rate
## 1 2.958976 1175.0000
## 2 2.775086 600.0000
## 3 1.911136 425.0000
## 4 1.834204 400.0000
## 5 1.717444 500.0000
## 6 13.792084 209.6154
hc_loans2 = hc_loans[hc_loans$funded_timediff >= 0 & hc_loans$paid_timediff >= 0 , ]
hc_loans3 = hc_loans2[hc_loans$posted_datetime > strptime("2007-01-01", format="%Y-%m-%d"), ]
hc_loans3_json = jsonlite::toJSON(hc_loans3)
write(hc_loans3_json, file="hc_loans3.json")